## [1] 4898 13
## [1] "X" "fixed.acidity" "volatile.acidity" "citric.acid"
## [5] "residual.sugar" "chlorides" "free.sulfur.dioxide" "total.sulfur.dioxide"
## [9] "density" "pH" "sulphates" "alcohol"
## [13] "quality"
## 'data.frame': 4898 obs. of 13 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ fixed.acidity : num 7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
## $ volatile.acidity : num 0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
## $ citric.acid : num 0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
## $ residual.sugar : num 20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
## $ chlorides : num 0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
## $ free.sulfur.dioxide : num 45 14 30 47 47 30 30 45 14 28 ...
## $ total.sulfur.dioxide: num 170 132 97 186 186 97 136 170 132 129 ...
## $ density : num 1.001 0.994 0.995 0.996 0.996 ...
## $ pH : num 3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
## $ sulphates : num 0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
## $ alcohol : num 8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
## $ quality : int 6 6 6 6 6 6 6 6 6 6 ...
## X fixed.acidity volatile.acidity citric.acid residual.sugar
## Min. : 1 Min. : 3.800 Min. :0.0800 Min. :0.0000 Min. : 0.600
## 1st Qu.:1225 1st Qu.: 6.300 1st Qu.:0.2100 1st Qu.:0.2700 1st Qu.: 1.700
## Median :2450 Median : 6.800 Median :0.2600 Median :0.3200 Median : 5.200
## Mean :2450 Mean : 6.855 Mean :0.2782 Mean :0.3342 Mean : 6.391
## 3rd Qu.:3674 3rd Qu.: 7.300 3rd Qu.:0.3200 3rd Qu.:0.3900 3rd Qu.: 9.900
## Max. :4898 Max. :14.200 Max. :1.1000 Max. :1.6600 Max. :65.800
## chlorides free.sulfur.dioxide total.sulfur.dioxide density pH
## Min. :0.00900 Min. : 2.00 Min. : 9.0 Min. :0.9871 Min. :2.720
## 1st Qu.:0.03600 1st Qu.: 23.00 1st Qu.:108.0 1st Qu.:0.9917 1st Qu.:3.090
## Median :0.04300 Median : 34.00 Median :134.0 Median :0.9937 Median :3.180
## Mean :0.04577 Mean : 35.31 Mean :138.4 Mean :0.9940 Mean :3.188
## 3rd Qu.:0.05000 3rd Qu.: 46.00 3rd Qu.:167.0 3rd Qu.:0.9961 3rd Qu.:3.280
## Max. :0.34600 Max. :289.00 Max. :440.0 Max. :1.0390 Max. :3.820
## sulphates alcohol quality
## Min. :0.2200 Min. : 8.00 Min. :3.000
## 1st Qu.:0.4100 1st Qu.: 9.50 1st Qu.:5.000
## Median :0.4700 Median :10.40 Median :6.000
## Mean :0.4898 Mean :10.51 Mean :5.878
## 3rd Qu.:0.5500 3rd Qu.:11.40 3rd Qu.:6.000
## Max. :1.0800 Max. :14.20 Max. :9.000
## vars n mean sd median trimmed mad min max range skew
## X 1 4898 2449.50 1414.08 2449.50 2449.50 1815.44 1.00 4898.00 4897.00 0.00
## fixed.acidity 2 4898 6.85 0.84 6.80 6.82 0.74 3.80 14.20 10.40 0.65
## volatile.acidity 3 4898 0.28 0.10 0.26 0.27 0.09 0.08 1.10 1.02 1.58
## citric.acid 4 4898 0.33 0.12 0.32 0.33 0.09 0.00 1.66 1.66 1.28
## residual.sugar 5 4898 6.39 5.07 5.20 5.80 5.34 0.60 65.80 65.20 1.08
## chlorides 6 4898 0.05 0.02 0.04 0.04 0.01 0.01 0.35 0.34 5.02
## free.sulfur.dioxide 7 4898 35.31 17.01 34.00 34.36 16.31 2.00 289.00 287.00 1.41
## total.sulfur.dioxide 8 4898 138.36 42.50 134.00 136.96 43.00 9.00 440.00 431.00 0.39
## density 9 4898 0.99 0.00 0.99 0.99 0.00 0.99 1.04 0.05 0.98
## pH 10 4898 3.19 0.15 3.18 3.18 0.15 2.72 3.82 1.10 0.46
## sulphates 11 4898 0.49 0.11 0.47 0.48 0.10 0.22 1.08 0.86 0.98
## alcohol 12 4898 10.51 1.23 10.40 10.43 1.48 8.00 14.20 6.20 0.49
## quality 13 4898 5.88 0.89 6.00 5.85 1.48 3.00 9.00 6.00 0.16
## kurtosis se
## X -1.20 20.21
## fixed.acidity 2.17 0.01
## volatile.acidity 5.08 0.00
## citric.acid 6.16 0.00
## residual.sugar 3.46 0.07
## chlorides 37.51 0.00
## free.sulfur.dioxide 11.45 0.24
## total.sulfur.dioxide 0.57 0.61
## density 9.78 0.00
## pH 0.53 0.00
## sulphates 1.59 0.00
## alcohol -0.70 0.02
## quality 0.21 0.01
## X fixed.acidity volatile.acidity citric.acid residual.sugar chlorides free.sulfur.dioxide
## 1 1 7.0 0.27 0.36 20.7 0.045 45
## 2 2 6.3 0.30 0.34 1.6 0.049 14
## 3 3 8.1 0.28 0.40 6.9 0.050 30
## 4 4 7.2 0.23 0.32 8.5 0.058 47
## 5 5 7.2 0.23 0.32 8.5 0.058 47
## 6 6 8.1 0.28 0.40 6.9 0.050 30
## total.sulfur.dioxide density pH sulphates alcohol quality
## 1 170 1.0010 3.00 0.45 8.8 6
## 2 132 0.9940 3.30 0.49 9.5 6
## 3 97 0.9951 3.26 0.44 10.1 6
## 4 186 0.9956 3.19 0.40 9.9 6
## 5 186 0.9956 3.19 0.40 9.9 6
## 6 97 0.9951 3.26 0.44 10.1 6
Omitting some features that appeared less interesting for brevity:
##
## 3 4 5 6 7 8 9
## 20 163 1457 2198 880 175 5
##
## L M H
## 183 3655 1060
## vars n mean sd median trimmed mad min max range skew kurtosis se
## 1 1 4898 6.85 0.84 6.8 6.82 0.74 3.8 14.2 10.4 0.65 2.17 0.01
## vars n mean sd median trimmed mad min max range skew kurtosis se
## 1 1 4898 0.28 0.1 0.26 0.27 0.09 0.08 1.1 1.02 1.58 5.08 0
## vars n mean sd median trimmed mad min max range skew kurtosis se
## 1 1 4898 6.39 5.07 5.2 5.8 5.34 0.6 65.8 65.2 1.08 3.46 0.07
## vars n mean sd median trimmed mad min max range skew kurtosis se
## 1 1 4898 0.05 0.02 0.04 0.04 0.01 0.01 0.35 0.34 5.02 37.51 0
## vars n mean sd median trimmed mad min max range skew kurtosis se
## 1 1 4898 138.36 42.5 134 136.96 43 9 440 431 0.39 0.57 0.61
## 10% 20% 30% 40% 50% 60% 70% 80% 90% 99% 100%
## 87.00 102.00 113.00 124.00 134.00 147.00 160.00 176.00 195.00 241.03 440.00
## vars n mean sd median trimmed mad min max range skew kurtosis se
## 1 1 4898 0.99 0 0.99 0.99 0 0.99 1.04 0.05 0.98 9.78 0
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9871 0.9917 0.9937 0.9940 0.9961 1.0390
## vars n mean sd median trimmed mad min max range skew kurtosis se
## 1 1 4898 3.19 0.15 3.18 3.18 0.15 2.72 3.82 1.1 0.46 0.53 0
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.00 9.50 10.40 10.51 11.40 14.20
## vars n mean sd median trimmed mad min max range skew kurtosis se
## 1 1 4898 10.51 1.23 10.4 10.43 1.48 8 14.2 6.2 0.49 -0.7 0.02
Use a variety of plot types to get an indication of whether a given feature varies significantly with quality
Ommiting several features that appeared less interesting for brevity:
See above
residual.sugar - Interesting due to it’s long thick tail, compared to the other features with some suggestion that it is in fact bi-modal. Some suggestion that higher quality wines have lower levels than average wines
alcohol - By far the clearest indicator of quality, with higher alcohol content indicating higher quality. Also the distribution was much more Platykurtic compared with the other input variables which mostly tended to be Leptokurtic
(Suspect there is a relationship between high alcohol and low residual.sugar?)
chlorides - Looks like higher quality wines have lower chloride levels
density - Looks like higher quality wines have lower density
Not exactly but I did create a variation of the output variable - quality.category
Rather than have to consider all 10 possible quality levels I instead had 3 categories (Low, Med, High) to simplify analysis
residual.sugar had a long thick tail, so I performed a log 10 transformation which enabled me to get a clearer idea of where the bulk of the data lay.
As well as a large peak around 2(ish) there was a shorter/fatter (but roughly equal in size) peak around 10.
Produce a bi-variate matrix to show correlation/distribution between each pair of features to steer further analysis. Produce a matrix for all the data and then again for just the higher quality wines to see (if) how they vary.
This takes ages to produce, so I pre-prepared images (and commented out code):
I will use this output to choose which bi and multi variate plots to produce
Scatter plots to compare individual features against quality.
Omitting several features for brevity:
##
## Pearson's product-moment correlation
##
## data: wines$quality and wines$volatile.acidity
## t = -13.891, df = 4896, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.2215214 -0.1676307
## sample estimates:
## cor
## -0.194723
##
## Spearman's rank correlation rho
##
## data: wines$quality and wines$volatile.acidity
## S = 2.3434e+10, p-value < 2.2e-16
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
## rho
## -0.1965617
##
## Pearson's product-moment correlation
##
## data: wines$quality and wines$residual.sugar
## t = -6.8603, df = 4896, p-value = 7.724e-12
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.12524103 -0.06976101
## sample estimates:
## cor
## -0.09757683
##
## Spearman's rank correlation rho
##
## data: wines$quality and wines$residual.sugar
## S = 2.1191e+10, p-value = 8.822e-09
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
## rho
## -0.08206979
##
## Pearson's product-moment correlation
##
## data: wines$quality and wines$chlorides
## t = -15.024, df = 4896, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.2365501 -0.1830039
## sample estimates:
## cor
## -0.2099344
##
## Spearman's rank correlation rho
##
## data: wines$quality and wines$chlorides
## S = 2.5743e+10, p-value < 2.2e-16
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
## rho
## -0.3144885
##
## Pearson's product-moment correlation
##
## data: wines$quality and wines$total.sulfur.dioxide
## t = -12.418, df = 4896, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.2017563 -0.1474524
## sample estimates:
## cor
## -0.1747372
##
## Spearman's rank correlation rho
##
## data: wines$quality and wines$total.sulfur.dioxide
## S = 2.3436e+10, p-value < 2.2e-16
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
## rho
## -0.1966803
##
## Pearson's product-moment correlation
##
## data: wines$quality and wines$density
## t = -22.581, df = 4896, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.3322718 -0.2815385
## sample estimates:
## cor
## -0.3071233
##
## Spearman's rank correlation rho
##
## data: wines$quality and wines$density
## S = 2.6406e+10, p-value < 2.2e-16
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
## rho
## -0.348351
##
## Pearson's product-moment correlation
##
## data: wines$quality and wines$alcohol
## t = 33.858, df = 4896, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.4126015 0.4579941
## sample estimates:
## cor
## 0.4355747
##
## Spearman's rank correlation rho
##
## data: wines$quality and wines$alcohol
## S = 1.096e+10, p-value < 2.2e-16
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
## rho
## 0.4403692
(more on this in the Multivariate section)
confirmed that there is a strong positive correlation between alcohol and quality.
surprised by the very small correlation between residual.sugar and quality. Curiously it looks like (on average) levels of residual.sugar start low for low quality wines, rise for medium quality wines and dip again as we move into high quality wines. What if anything does this suggest?
(I’m ignoring density because I think it is a direct consequence of levels of the above)
strong +ve correlation between total.sulfur.dioxide and density and corresponding strong -ve correlation between total.sulfur.dioxide and alcohol
Surprised that whilst we observe a strong -ve correlation between fixed.acidity and pH (as you might expect) there is little to no relationship between volatile.acidity and pH
Strongest relationship between a feature and the output variable (quality) was for ‘alcohol’ with (Pearson) correlation of 0.436
Overall the strongest correlation was between residual.sugar and density with (Pearson) correlation of 0.839. Closely followed by alcohol and density (-0.78).
Look for correlations between input variables, start with those that seem to have a significant impact upon quality and indeed split by quality (to provide a third and thus multivariate plot)
##
## Pearson's product-moment correlation
##
## data: wines$chlorides and wines$alcohol
## t = -27.016, df = 4896, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.3843183 -0.3355673
## sample estimates:
## cor
## -0.3601887
##
## Spearman's rank correlation rho
##
## data: wines$chlorides and wines$alcohol
## S = 3.0763e+10, p-value < 2.2e-16
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
## rho
## -0.5708064
##
## Pearson's product-moment correlation
##
## data: wines$density and wines$alcohol
## t = -87.255, df = 4896, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.7908646 -0.7689315
## sample estimates:
## cor
## -0.7801376
##
## Spearman's rank correlation rho
##
## data: wines$density and wines$alcohol
## S = 3.568e+10, p-value < 2.2e-16
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
## rho
## -0.8218551
##
## Pearson's product-moment correlation
##
## data: wines$residual.sugar and wines$alcohol
## t = -35.321, df = 4896, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.4726723 -0.4280267
## sample estimates:
## cor
## -0.4506312
##
## Spearman's rank correlation rho
##
## data: wines$residual.sugar and wines$alcohol
## S = 2.8304e+10, p-value < 2.2e-16
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
## rho
## -0.4452574
##
## Pearson's product-moment correlation
##
## data: highwines$volatile.acidity and highwines$alcohol
## t = 19.179, df = 1058, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.4618272 0.5512684
## sample estimates:
## cor
## 0.5079155
##
## Pearson's product-moment correlation
##
## data: wines$density and wines$chlorides
## t = 18.624, df = 4896, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.2308679 0.2831779
## sample estimates:
## cor
## 0.2572113
##
## Spearman's rank correlation rho
##
## data: wines$density and wines$chlorides
## S = 9629500000, p-value < 2.2e-16
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
## rho
## 0.5083018
Article discussing relationhsip between alcohol and r.s also suggests you need high acidity with high sugar
##
## Pearson's product-moment correlation
##
## data: wines$residual.sugar and wines$fixed.acidity
## t = 6.2537, df = 4896, p-value = 4.348e-10
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.06116674 0.11673612
## sample estimates:
## cor
## 0.0890207
##
## Spearman's rank correlation rho
##
## data: wines$residual.sugar and wines$fixed.acidity
## S = 1.7494e+10, p-value = 6.955e-14
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
## rho
## 0.1067249
##
## Pearson's product-moment correlation
##
## data: highwines$residual.sugar and highwines$fixed.acidity
## t = 8.3967, df = 1058, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.1926383 0.3055648
## sample estimates:
## cor
## 0.2499514
residual.sugar vs fixed.acidity
[residual.sugar vs volatile.acidity]
for completeness I also checked res.sug against volatile.acidity, this showed virtually no correlation for any quality
##
## Pearson's product-moment correlation
##
## data: wines$pH and wines$fixed.acidity
## t = -32.934, df = 4896, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.4485154 -0.4026542
## sample estimates:
## cor
## -0.4258583
##
## Pearson's product-moment correlation
##
## data: wines$residual.sugar and wines$total.sulfur.dioxide
## t = 30.669, df = 4896, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.3776791 0.4246712
## sample estimates:
## cor
## 0.4014393
During the bi-variate section it looked like:
similarly, Big cluster of high quality wines with ‘low chlorides’
similarly, Smaller cluster of high quality wines with ‘high chlorides’
The multivariate plot between chlorides and alcohol re-inforces this suspision, it looks like we see
We appear to be seeing similar behaviour for these two:
It’s less clear medium (and low) quality wines fall between these clusters
(please note: the ‘=’ is misleading because of course there are also lower quality wines that follow the same pattern)
The strong positive correlation between volatile.acidity and alcohol for higher quality wines was surprising given there was virtually no correlation for lower and medium quality wines
NO
Alcohol is by far the single biggest contributor to the quality of white wine (in this sample). As a crude measure it has a correlation (Pearson) value of 0.436, over twice as much as the next most significant (chlorides at -0.21)
The density plot shows quite clearly the marked increase in quality as you increasethe level of alcohol, with a sweetspot between around 11 to 13.
The distribution of low and medium quality wines is ‘quite similar’ whereas (as expected) the distribution of high quality wines is noticeably right shifted (higher alcohol).
It’s also worth noting that there is a smaller peak of higher quality wines with lower alcohol content (around 9%). So, we have:
A similar pattern could be observed for certain other features. For example:
We’ll look at how these contribute together towards the quality of wine in the following plots
Here we plot chlorides against alcohol facet wrapped by qaulity to see whether they re-inforce each other. In particular to see whether the high quality clusters (large and small) outlined earlier still exist when we combine features.
Although it is difficult to be certain, particularly because of the uneven distribution (there are many more medium quality wines), it looks like the two high quality clusters hypothesis just about still holds.
There does appear to be a large clustering of blue (high quality) top right with low chlorides and high alcohol, though it’s not very dense. Similarly there appears to be a smaller cluster bottom right (low alcohol and high chlorides).
Moreover (with a bit of a squint) it looks like the critical mass of low/medium quality wines sit somewhere between the two clusters.
An analogous pattern could be observed if we also plotted the following:
There are two recipes for a good white wine:
In actual fact there are plenty of poorer quality wines that follow these recipes but you will increase your probability of having a high quality wine.
When plotting volatile.acidity against alcohol split by quality we see a marked difference in distribution for high quality wines compared to the rest.
We see a fairly strong positive correlation (Pearson 0.5) for high quality wines vs virtually nothing for the rest.
So, getting the balance of volatile.acidity to alcohol level correct might be a further worthwhile consideration when producting/predicting/evaluating white wine. ——
All the analysis suffers from one big flaw which is my fundamental ignorance of chemistry (subject matter expertise - missing), meaning that observations that seem interesting to me might well be obvious/inevitable and vice versa.
Also this is a relatively small data set and thus subject to large errors and wrong conclusions.
The vast majority of the data had mid ranging quality values (5 or 6), it was therefore difficult to compare, in practice I suspect there was more distance between those wines than a difference of just one point would suggest. A finer scale might have enabled more interesting analysis and/or the individual ratings per reviewer/per wine rather than an aggregated score per wine.
I found it difficult (and eventually abandoned it) to abstract repeated code into functions. This was largely due to the fact that for most plots I had to forensicly/manually calculate bounds.
Moving forwards it would (perhaps) be interesting to apply a clustering model (e.g. K-Means) to the high quality wines to determine whether the two categories do really exist.
Additionally we could look to train a predictive model (e.g. logistic regression or neural network) to predict the quality of wine (perhaps explicitly favouring the selected features)